LSQ: Learned Step Size Quantization

19

FIGURE 2.1

Computation of a low-precision convolution or fully connected layer, as envisioned here.

This technique uses low-precision inputs, represented by ¯w and ¯x, in matrix multiplication

units for convolutional or fully connected layers in deep learning networks. The low-precision

integer matrix multiplication units can be computed efficiently, and a step size then scale

the output with a relatively low-cost high-precision scalar-tensor multiplication. This scaling

step has the potential to be combined with other operations, such as batch normalization,

through algebraic merging, as shown in Fig. 2.1. This approach minimizes the memory and

computational costs associated with matrix multiplication.

2.2.2

Step Size Gradient

LSQ offers a way of determining s based on the training loss through the incorporation of

a gradient into the step size parameter of the quantizer as:

ˆv

∂s =

v/s +v/s,

ifQN < v/s < Qp,

QN,

if v/sx,

QP ,

if v/sQp.

(2.10)

The gradient is calculated using the straight-through estimator, as proposed by [9], to

approximate the gradient through the round function as a direct pass. The round function

remains unchanged to differentiate downstream operations, while all other operations are

differentiated conventionally.

The gradient calculated by LSQ is different from other similar approximations (Fig.

2.2) in that it does not transform the data before quantization or estimate the gradient by

algebraically canceling terms after removing the round operation from the forward equation

resulting inˆv/∂s = 0 whenQN < v/s < QP [43]. In these previous methods, the

proximity of v to the transition point between quantized states does not impact the gradient

of the quantization parameters. However, it is intuitive that the closer a value of v is to a

quantization transition point, the more likely it is to change its quantization bin ˆv with a

small change in s, resulting in a large jump in ˆv. This means thatˆv/∂s should increase

as the distance from v to a transition point decreases, as observed in the LSQ gradient.

Notably, this gradient emerges naturally from the simple quantizer formulation and the use

of the straight-through estimator for the round function.

In LSQ, each layer of weights and each layer of activations have their unique step size

represented as a 32-bit floating point value. These step sizes are initialized to 2|v|/QP and

are calculated from the initial weight values or the first batch of activations, respectively.